Run-Length Compressed Indexes for Repetitive Sequence Collections
نویسندگان
چکیده
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. Flexible and efficient data analysis on a such typically huge collection is plausible using suffix trees. However, suffix tree occupies O(N logN) bits, which very soon inhibits in-memory analyses. Recent advances in full-text indexing reduce the space of suffix tree to NHk + o(N log σ) bits at the cost of running times of its operations increasing by polylog(N) factor. Here Hk is the k-th order entropy of the collection and σ is the alphabet size. Notice that for r identical copies of an incompressible base sequence, the bound simplifies to N log σ(1 + o(1)) bits. We develop new static/dynamic full-text self-indexes based on the run-length encoding whose space-requirements are much less dependent on N . For example, we obtain an index occupying R log σ(1+o(1))+R log NR (1+o(1))+r log n+O((s+r) log(s+r)) bits, where s is the total number of basic edit operations to convert the r repeats into substrings of the base sequence, and R ≤ min(n, nHk)+O((s+ r) logσ N), where the O() term holds in the expected case. The new indexes can be plugged into a recent dynamic fully-compressed suffix tree using an additional O((N/δ) logN) bits of space for any δ = polylog(N), and retaining the polylog(N) time slowdown on operations. Computing Reviews (1998)
منابع مشابه
Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections
A repetitive sequence collection is one where portions of a base sequence of length n are repeated many times with small variations, forming a collection of total length N . Examples of such collections are version control data and genome sequences of individuals, where the differences can be expressed by lists of basic edit operations. This paper is devoted to studying ways to store massive se...
متن کاملDocument Listing on Repetitive Collections
Many document collections consist largely of repeated material, and several indexes have been designed to take advantage of this. There has been only preliminary work, however, on document retrieval for repetitive collections. In this paper we show how one of those indexes, the run-length compressed suffix array (RLCSA), can be extended to support document listing. In our experiments, our addit...
متن کاملIndexing Highly Repetitive Collections
The need to index and search huge highly repetitive sequence collections is rapidly arising in various fields, including computational biology, software repositories, versioned collections, and others. In this short survey we briefly describe the progress made along three research lines to address the problem: compressed suffix arrays, grammar compressed indexes, and Lempel-Ziv compressed indexes.
متن کاملUniversal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We int...
متن کاملStorage and Retrieval of Highly Repetitive Sequence Collections
A repetitive sequence collection is a set of sequences which are small variations of each other. A prominent example are genome sequences of individuals of the same or close species, where the differences can be expressed by short lists of basic edit operations. Flexible and efficient data analysis on such a typically huge collection is plausible using suffix trees. However, the suffix tree occ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008